Method for anonymization of data collected within a mobile communication network

ABSTRACT

The invention relates to a method for anonymization of event data collected within a system or network providing a service for subscribers/customers wherein each event data set is related to an individual subscriber/customer of the system/network and includes at least one attribute wherein the method counts the number of event data sets related to varying individual subscribers having identical or nearly identical values for at least one attribute. The invention further relates to a method for anonymization of static data related to individual subscribers of a mobile communication network wherein each static data set consist of different attributes and the method identifies specific profiles derivable form the static data and drops one or more respective attribute of the static data sets and/or classifies two or more static data sets to a certain group having at least one matching attribute.

BACKGROUND OF THE INVENTION

The invention relates to a method for anonymization of event datacollected within a network and to a method for anonymization of customerrelation data of a mobile communication network.

Operators of arbitrary systems or networks, i.e. applied in the bankingsector, public health sector, telecommunication sector, etc., registercustomer related data such as personal information about theircustomers, contact details and optionally contract information. Forinstance, the data includes attributes regarding the subscriber's name,address, date of birth, bank data and many more. The collection of thisdata is either necessary for administration, billing purposes or to holdavailable for authorities. In the following such data is defined ascustomer relation data (CRM) or static data.

Furthermore, said systems/network might continuously collect additionaldata during regular system/network operation. The generation ofso-called event data is triggered by subscriber activity that raises acertain event within the system or by the system itself. An event dataset includes several attributes describing different properties of thetriggered event, for example a timestamp, event type, etc. These eventdata sets are associated with a personal identifier which enablesallocation of the generated event data set to an individual customer ofthe system/network.

One particular application of such a system is a mobile communicationsystem which enables communication between two or more subscribers.Operators of communication systems register subscriber related data suchas personal information about the subscribers, contact details andcontract information. For instance, the data includes attributesregarding the subscriber's name, address, date of birth, bank data andmany more. The collection of this data is either necessary for billingpurposes or to hold available for authorities. In the following suchdata is defined as customer relation data (CRM) or static data.

As event data sets network providers continuously collect additionaldata called as location event data during regular network operation.Each location event data set is related to a specified event of anindividual subscriber. Events may be triggered by a subscriber/user, thenetwork or of a device which is of no importance for further processing.The data set includes several attributes such as an event attributedescribing the event type, one or more location attributes identifyingthe geographical location where said event was triggered by thesubscriber and a timestamp defining the time of the event. Theselocation event data sets are associated with a personal identifier whichenables allocation of the location event data set to an individualsubscriber of the communication system.

Due to holding this information such systems/networks, in particularmobile communication systems, offer the possibility to provideinformation about the subscriber habits, in particular regarding thelocation data for a defined time interval. This data can either be usedto create location profiles for geographical sites or to derive dynamiccrowd movement patterns. In this context, the information could beuseful for a wide range of applications in the area of traffic services,smart city services, infrastructure optimization services, retailinsight services, security services and many more. Therefore, it isdesirable to provide the generated information in suitable form toparties that benefit from applications like the aforementioned ones.Such parties could include local councils, public transport andinfrastructure companies like public transport providers or electricitysuppliers, retailers, major event organizers or public safety bodies andmany more yet unknown uses and users.

However, it is mandatory to provide this information in an anonymousmanner to protect the privacy of each individual, in particular eachcustomer/subscriber of the system or mobile communication network.Consequently, the provider of the system/mobile communication networksupplying the data should only provide insights extracted fromanonymized and aggregated data without disclosing personal information.Disclosure of any personal information is strictly prohibited, trackingand identifying of individuals has to be avoided in any circumstance.

A potential attacker may identify the subscriber of the generatedlocation event data by simply observing the subscriber and an observableevent which is detectable to an observing bystander due to actions ofthe subscriber himself. Furthermore, if too few subscribers of a mobilecommunication network trigger the generation of the location event dataat a small geographical area, the single subscriber may be identified bysaid small geographical area, for instance if said area characterizeshis/her place of living or work.

An additional attack scenario might be the determination of dynamicprofiles from behavior patterns. Associating a plurality of dynamicallyoccurring event data to an ID may lead to a unique event profile (e.g.Event Location Profile). The bigger the profile and the longer the IDremains constant, the more comprehensive (sensible) is the informationthat is collected in respect with a certain ID. At the same time theprobability for finding additional information increases (from thirdparty sources), which enable assigning the profile to a specificindividual. Therefore, derivation of a dynamic profile affects thedisproportionate between the effort for re-identification and the needfor protection (increases with increasing sensitivity) of the data.

Another attacking scenario is the derivation of static fingerprints fromperson-specific properties. If a single ID has certain properties, which(individually or in combination) are unique, two effects may arise:

-   -   (a) The properties permit a direct reference to individuals on        the basis of appropriate additional knowledge or    -   (b) The properties themselves may constitute an identifier due        to their uniqueness wherein the identifier allows creation of        full dynamic profiles despite a regular change of the ID.

SUMMARY OF THE INVENTION

It is the object of the invention to provide a method for theanonymization of data collected or used within an arbitrary system ornetwork, for instance a mobile communication network, and which are eachrelated to an individual customers/subscriber of the system/mobilecommunication network.

The aforementioned object is solved by a method according to thedescription herein. Preferred embodiments are also the subject matter ofthe description herein.

A method for anonymization of event data collected within a system ornetwork providing a service for subscribers/customers wherein each eventdata set is related to an individual subscriber/customer of thesystem/network and includes at least one attribute wherein the methodcounts the number of event data sets related to varying individualsubscribers having identical or nearly identical values for at least oneattribute. It should be noted that the expression attribute is used interm of a specified property of the event. The attribute might stand fora certain category which may take a certain value. For instance, anevent attribute can take different values defining different eventtypes.

Each event data set consists of one or more attributes, for instancecontaining the time when an event took place but may also containinformation about type of event.

Each event data set is related to an individual customer of the system,in particular by associating the data set with a personal identifier,specifically with an anonymized personal identifier.

Therefore, each event data set describes an individual event which wastriggered by a specified customer of the system/network. For storing andsupplying purpose of said collected data, it is mandatory tosufficiently anonymize the data in order to avoid any identification ofthe individual customer.

Therefore, the inventive method identifies certain attribute values withlittle activity. That is to say the inventive method counts the numberof events which have the same or nearly the same values for at least oneattribute and which are triggered by different customers. The lower thenumber of events triggered by different customers the higher thepotential risk of deanonymization which means that the number ofdifferent customers is significant. If the number of different customersincreases the effort to achieve a deanonymization must be significantlyhigher than what is to be gained.

In a particular preferred embodiment of the invention the system/networkis a mobile communication network and the event data refers to an eventdata set collected within the mobile communication network. The mobilecommunication network can be conducted as a mobile communication networkaccording to the 2G, 3G or any other mobile communication standard.Additionally or alternatively, the mobile communication network relatesto a wireless location area network.

Each location event data set consists of one or more attributes at leastcontaining the time when an event took place but may also containinformation about the place or the type of event in the mobilecommunication network. At least one other attribute is noted as alocation attribute defining the location where the event occurred. Inthat case, the event data set is specified as a location event data set.Each location event data set is related to an individual subscriber ofthe mobile communication network, in particular by associating the dataset with a personal identifier, specifically with an anonymized personalidentifier.

Therefore, each location event data set describes an individual eventwhich was triggered by a specified subscriber of the mobilecommunication network. For storing and supplying purpose of saidcollected data, it is mandatory to sufficiently anonymize the data inorder to avoid any identification of the individual subscriber.

Therefore, the inventive method identifies locations with littleactivity. That is to say the inventive method counts the number ofevents which occur at a certain location and which are triggered bydifferent subscribers. The lower the number of events triggered bydifferent subscribers the higher the potential risk of deanonymizationwhich means that the number of different subscribers is significant. Ifthe number of different subscribers increases the effort to achieve adeanonymization must be significantly higher than what is to be gained.

For the sake of convenience the subsequent preferred aspects of theinventive method are described on the basis of a mobile communicationsystem and location event data as a certain type of event data. However,the present invention should not be limited thereto.

The method preferably executes monitoring and counting over a certaintime interval and subsequently restarts monitoring and counting in a newtime interval.

In a preferred embodiment of the invention, the method discards allcollected location event data sets with events occurring at a certainlocation if the counted number of these location event data sets relatedto different subscribers is less than a defined threshold after adefined time interval. The threshold shall be defined dynamically inorder to achieve a good ratio between the use of the location event dataand the sufficient anonymization of the location event data. Saidthreshold may be either fixed, set according to situation or user use orset dynamically according to other rules, event requirements.

Alternatively, in a different aspect of the inventive method, allpersonalized information included in the collected location event datasets, in particular the anonymized personal identifier, may be deletedif the counted number is less than a defined threshold. By discardingall personalized information included in the location event data sets,it is not possible to distinguish whether a certain number of eventsoccurred at a certain location has been triggered by one or more thanone subscriber.

In a particular preferred aspect of the invention, the method merges thecollected location event data sets of different locations if the numberof at least one location is less than a defined threshold. Saidparticular preferred approach keeps a relationship between differentdata sets. In contrary to the aforementioned approach it is nowdeterminable whether certain events at certain locations have beentriggered by different subscribers. Nevertheless, said preferredapproach will guarantee that events at a certain location have beentriggered by a sufficient number of different subscribers, therebyavoiding an easy deanonymization process.

Merging of the location event data sets of different locations can beperformed by replacing the location attributes of said location eventdata sets by a generalized location attribute. The number of locationevent data sets related to varying individual subscribers is counted fora certain location wherein the location can be defined as a certaingeographical area. Merging of the location event data sets can be eitherperformed by increasing the geographical area and/or increasing theradius of said geographical area and/or increasing the inaccuracy of thearea or the decision whether an event occurred with said area.Furthermore, different location attributes which describe geographicalareas which are adjacent to each other can be combined to a largergeographical area including both smaller areas. Therefore, the locationattributes of the merged location event data sets are replaced by saidlarger geographical area. It is also possible to combine congenericgeographical areas which are not located adjacent to each other.

By concentrating location event data sets of adjacent locations thenumber of events related to different subscribers increases. Merging ofadjacent locations is to be repeated until the number of individualsubscribers triggering location event data sets exceeds the definedthreshold. A limited inexactness of the replaced location information isaccepted due to a more sufficient anonymization of the location eventdata.

It is conceivable that only location event data sets of locations with acounted number of location event data sets below the defined thresholdare merged. However, if the desired number of location event data setsis not achievable by only merging said locations it might also bepossible to merge a location with a number of location event data setsbelow the defined threshold with one location with a number of locationevent data sets exceeding the defined threshold wherein merged areas mayeach be below the set threshold but combined exceed it.

The generation and collection of a location event data set is preferablytriggered by an individual subscriber who requests for a specifiedservice of the mobile communication network. For instance, requesting aservice may include the transmission of a short message (SMS, MMS), anincoming and/or outgoing telephone call, and an initiation of a datasession or the like. An event which is indirectly triggered by thesubscriber for example is a known handover process which will alsogenerate a location event data set. A subscriber terminal may alsoinitiate a positioning process by itself, in particular based on a GPSreceiver and triggered by a smartphone app, but also using otherlocation methods such as using Wi-Fi or mobile network cells, byperiodical location updates by the network or active paging by thenetwork. Further, the terminal might be passively located by thenetwork, for instance based on triangulation. Such events may alsogenerate respective location event data.

The location event data sets include a timestamp attribute defining thetime at which the event occurred. In case of a visual event a potentialattacker could compare the event observed in the real world to theavailable location event data sets. If the observed event matches to anavailable timestamp and event type within a location event data set theincluded anonymized personal identifier is allocatable to a certainperson (the personal identifier remains anonymized/hashed although).Therefore, according to a preferred aspect of the invention thetimestamp attribute is also obfuscated, especially for observable eventtypes but obfuscation is also possible for non-observable events.

In a preferred embodiment of the invention the timestamp is modified byeither rounding the timestamp or by adding a time-offset which in turnmay be randomly set. Therefore, a potential attacker cannot associate avisible event to a specified location event data set with respect to thestored timestamp.

In another preferred embodiment of the invention event types can bedefined, which are combined into certain classes. An attacker couldtrigger a plurality of events of a certain type in a certain way (e.g.in a temporal pattern), for instance a transmission of fifteen SMS istriggered wherein each is send in a gap of seven minutes and sevenseconds to each other. Such a behavior pattern could traceable in thedata base. By an adequate combination of certain event types intoclasses such attacking scenarios can be prevented. For example, allevents (SMS reception, call input, etc.) which can be triggered by thirdparties can be combined into a common class which increases the effortfor an attacker to trace respective behavior patterns within the database.

The present invention also relates to a method for anonymization ofstatic data related to individual subscribers of a mobile communicationnetwork with features described herein. Preferred embodiments are alsothe subject-matter of the description herein.

According to the invention, each static data set, also noted as customerclass data (CCD), consists of different attributes referring to personalinformation about the subscriber. The information contained in a singledata set is therefore defined by the combination of the differentattributes referring to personal information and other nonpersonalattributes. The inventive method ensures that the number of occurrencesof data sets with identical personal information (i.e. identicalcombination of attributes with personal information) is higher than aconfigurable threshold. This can be achieved by the use of animplementation of k-anonymity. The general implementation of k-anonymitywas proposed by P. Samarati and L. Sweeney. In this context, referenceis made to P. Samarati and L. Sweeney, “Generalizing Data to ProvideAnonymity When Disclosing Information”, Proceedings of the seventeenthACM SIGACT-SIGMOD-SIGART symposium on Principles of Database Systems,pp. 188, 1998 (ACM).

The inventive method therefore applies k-anonymity for anonymization ofstatic data related to individual subscribers of a mobile communicationnetwork by either suppressing particular attribute values or bygeneralization of attribute values which effectively means that the datasets are transformed to hold to less specific information.

Therefore, values of at least one attribute type are replaced by generalvalues. The selection of attributes to be generalized, i.e. classified,can be considered in a hierarchical order, for instance classificationof all attributes regarding a time is performed at first. If therequirement with respect to deanonymization are not fulfilledclassification is applied to a different kind of attributes, forinstance related to location or event information. It is also possibleto have a hierarchical order for classification within a certainattribute type. For instance, the inaccuracy of location information ofa location attribute is increased in steps. Location information can begiven by a cell identifier and replaced by a multi-digit zip-code. Forfurther increasing the inaccuracy of said location information thezip-code may be reduced to a lower number of digits.

The process of classifying is preferably performed by replacing acertain attribute of two or more static data sets with a generalizedcommon attribute. The step of classification is further preferablyperformed until the number of static data sets of each group exceeds adefined threshold.

It is possible that at least one attribute of the static data setrelates to the gender or the date of birth or the age or the professionor the place of residence of an individual subscriber of the mobilecommunication network.

Classification may be done by generalizing several dates of birth ofdifferent subscribers to a specified time interval including thedifferent dates of birth. Furthermore, the attribute referring to theprofession of a subscriber will be replaced by an attribute describingthe industrial sector of the respective professions. For example, healthprofessionals and nurses will classified as persons working in thehealth sector.

In a preferred embodiment according to method for anonymization ofstatic data a subset of the total static data available at the mobilecommunication network is pre-determined dependent on at least onecriterion wherein the anonymization method is only performed for thepredetermined subset. So far, anonymization of static data is applied tothe entire customer base (static data base) or all persons in the database of the mobile network provider. However, if a subscriber shows adiscrepancy from normal behaviour, for instance if the subscriber is onvacation and the actual location differs from the home location for along time, the subscriber will be conspicuous during the filteringprocess requiring a filtering out of said subscriber. Otherwise thefilter outcome will be rather coarse. A limitation of the entiredatabase to a subset of the static data will improve anonymizationresults, in particular the degree of accuracy may increase.

Further preferable predetermination for a subset is based on a relevanttime period and/or relevant geographical section. For instance, thegeographical section is defined to a certain location like “BerlinAlexanderplatz” and the time period is defined by a current date“22.12.2014” and time interval “13:00-20:00 clock”. Further, relevantattributes (e.g. “sex” and “home area”) which may be of interest for thereceiving data aggregator are defined.

If the considered time window covers several days (generally short timeintervals ST) a query for the frequency of the reoccurrence of a certainevent would be also possible. Short-time IDs which are valid (constant)for short time in order to avoid creation of dynamic profiles and whichare present in an area A (e.g. data aggregator) are transferred toanother organizational field B (e.g. Different responsible entity, forinstance a third party or data supplier) by request. Section B possessesthe respective key for converting short-term IDs into long-term IDs byway of the MAP algorithm as will be described with reference to theembodiments. Due to the fact that various short-term IDs from A can beassociated with each other at section B it is possible to derive astatement of reoccurrence of a certain event within a long-term timeinterval. However, anonymity is preserved since A only transmits IDs andsection B does not receive information about the event data referring tothe IDs. As a response A only receives k-anonymous, aggregated data.Referring to the example above (“Alexanderplatz”) B does not even knowthat the received request from A is related to the locationAlexanderplatz or to a particular time interval. A receives theaggregated k-anonymous data without further information to which ID saidaggregated data is related to. Consequently, A is only in possession ofevent data and static data is located and limited to section B. Thepreferable request logic enables a statement from both data typeswithout merging the data types.

It may be conceivable that the subset is predetermined and requested bya data aggregator receiving the anonymized static data. Such a requestis than sent to a module performing the anonymization process includinggeneralization and/or suppression in order to achieve a sufficient levelof k-anonymity. The data aggregator comprises only event-based data, forinstance approved location event data received from the location eventdata filtering process as describe in the light of the first aspect ofthe present invention. The data supplier comprises only static data (CRMdata and event data derived from the statistics). By using differentobfuscated personal identifiers in both areas, the data can be relatedto each other only via said request of the data aggregator.

It may be advantageous if the anonymization method(generalization/suppression) is individually configurable for eachindividually predetermined subset wherein preferable configurationparameters are passed together with a request of the data aggregator.This enables a dynamic adaption of the filter settings which isoptimized for each requested subset.

For instance, configuration may include a prioritization of theattributes to be generalized/suppressed and/or a definition of amaximum/minimum hierarchy level for each attribute and/or a definitionof the ratio between generalization and suppression.

For instance, prioritization provides a weighting of the relevantattributes, that is to say which attribute should be generalized withhigher priority. Further, the minimum level of hierarchy refers to aminimum degree of accuracy which is desired for a relevant attribute.The maximum hierarchy level of each attribute refers to a maximumallowable generalization level before the data set is suppressed insteadof generalized. Lastly, a general weighting between suppression andgeneralization can also be defined to achieve a desired ration.

To avoid creation of static fingerprints as indicated in theintroductory part of the application the actual static properties of anID must always be obfuscated (generalized) in a way that they do notleverage the criterion of disproportionate in relation to anreidentification if considered by themselves or in any combination withother data, especially in combination with dynamically occurringlocation event data.

A special feature of the present invention is the introduction of atechnical working flow, which on the one hand effectively preventsstatic fingerprints by integration of a request logic and, however, atthe same time provides a high flexibility for each request ensuring amaximum quality of significance for each individual request.

The invention is also related to a communication system for performingthe method according to the description herein and/or the methodaccording to the description herein. It is clear that the communicationsystem is characterized by the properties and advantages according tothe inventive method. Therefore, a repeating description is deemed to beunnecessary.

BRIEF DESCRIPTION OF THE DRAWINGS

Further advantages and properties of the present invention are describedon the basis of two embodiments shown in the figures. The figures show:

FIG. 1: an architectural overview over the system using location eventdata filtering according to the invention,

FIG. 2: an architectural overview over a system according to FIG. 1 andadditional including static data filtering,

FIG. 3: schematic overview over the basic method steps of multi-levelanonymization process (MAP),

FIGS. 4A, 4B and 4C: flow diagrams respectively showing the process oflocation event data filtering,

FIG. 5: an architectural overview of a preferred embodiment of a systemaccording to FIG. 2 with an optional predetermination step forrequesting only a subset of static data and

FIG. 6: the system of FIG. 5 expended by an additional selection andextrapolation logic.

DESCRIPTION OF THE PREFERRED EMBODIMENTS

FIG. 3 illustrates the fundamental idea of a multi-level anonymizationprocess (MAP). The basic idea of that anonymization process relates to adata anonymization procedure to enable the usage of mass location datafor big data applications with fully respect of European data protectionstandards. Mass location data will be collected by mobile or wirelesscommunication network providers as well as providers collectinginformation which are based on other location technologies like GPS;Galileo, Glonass, Compass, sensor networks, etc. which additionally maypossess detailed and verified personal information about theirsubscribers. Further, mobile network providers are able to extractlocation event data of the subscribers. Anonymized and aggregatedinformation gathered by mobile network operators may provide interestinginsights for different applications by third parties.

For instance, mobile network providers might provide anonymized andaggregated data to local councils, public transport companies,infrastructure combines, like public transport providers or electricitysuppliers, retailers, major event organizers of public safety bodieswhich use such information for improving the decision-making processesand other yet unknown uses.

However, it is mandatory to take care of the privacy of each subscriberand personal subscriber information. By splitting over the process intoseveral process steps which are executed within different systems whichmay additionally in itself be located in different security zones withspecific access rights or even in legal premises of independententities, the possibility of generating an allocation table betweenanonymized and not anonymized identifiers is prevented.

As can be seen in FIG. 3 a data supplier referenced as DS iscommunicatively connected over a public or a virtual private network toa data aggregator referenced as DA. The data supplier entity DS can beany provider of movement and/or personal data. As described DS and DAare physically separated and assigned to independent systems. Generally,DS and DA fulfill different tasks which can be assigned to differentusers having differing authority profiles on a common system orperformed within different security zones of a common system.

The exemplary embodiments according to the figures are based on a mobilenetwork system as a data supplier DS which provides the aforementioneddata sets containing personal data as well as location event data aboutits subscribers. Each individual subscriber of the overall network ofthe DS is identified by a personal identifier PID which might be a knownidentifier as the IMSI of a subscriber. To have a real anonymizationaccording to European data protection standards, it is inter alianecessary to have separation of the initial PID and its counterpart, theO-PID (obfuscated personal identifier). In this context, the effort ofbringing together these two identifiers has to be unreasonably highcompared to the yield which could be earned by such an action. Thisrequirement is fulfilled if the separation is realized physically withinthe premises of two legally independent entities whereby one entity onlyknows the PID and the other one only the O-PID. However, the separationof the DS and DA can also be realized by one of the alternativepossibilities as proposed above. In any case, it is necessary to encryptand transmit the O-PID to a third party named as the data aggregator DA.That personal identifier is combined to a data set with additional dataattributes describing a certain location event. For instance, theseevent data attributes characterize an action of a subscriber at acertain place. Possible attributes are the event type, event locationand timestamp. In this example, encryption is only performed for thepersonal identifier but can also be done for other data.

The obfuscation of the sensible data should be realized by a multi-levelanonymization process (MAP) performed at the DS to protect the userprivacy. In a first step 1, a base anonymization is performed byapplying a non-reversible, keyed hashing algorithm to the PID, where thekey (DS-key) is only known to the data supplier DS. Said hashingalgorithm should be a strong cryptographic hash function. DifferentDS-keys may be available at the DS side having different lifetimes likeST/LT (short-time/long-time), for instance. The output of the firstmethod step is a single obfuscated PID referenced as O-PID. The lifetimeof such O-PID is dependent on the interval the DS-Key is changed. Thatis to say if the DS-Key is for example constant for 24 hours, the DAwill get a static obfuscated identifier for exactly such period of time.The type of DS-key used for obfuscating the PID depends on the dataset/data attributes which are transmitted to the DA or Third Party incombination with the obfuscated PID. For instance, a short term key(ST-key) is used for obfuscating the PID which is sent in combinationwith customer class data wherein a LT-Key is used for the MAP processwhen obfuscating the PID for transmitting location event data sets.

In a second step 2 a random component or string, e.g. preferably amulti-digit random number is added to the output O-PID of the baseanonymization procedure according to the first step 1. It is noted thatthe random number might be inserted at any position of the O-PID whereinthe position has to be known by the DA. It is further noted that anyother randomly generated character string and any other procedure ofcombining the two strings might be appropriate. The interval length ofthe used random number could also be variable, but has to be known bythe DA. The output of the second step is marked as O-PID+RC.

In the last step 3 a second-level encryption which is also calledadditional encryption “AE” is executed on the basis of an asymmetricencryption mechanism using the public key DA-Pub-Key of the secondentity DA. The asymmetric encryption is applied to the outcome of step 2O-PID+RC resulting in an outcome which is marked as OO-PID.Consequently, the PID is double obfuscated to protect the user privacy.

The lifetime of double encrypted identifier OO-PID is only dependent onthe interval on which the random number used in step 2 is changed. Thismeans that the OO-PID is constant as long as the RC is constant which isimportant for calculations done on the OO-PID by a trusted partner (e.g.building of statistical indices). In contrast, the actual value of therandom number is not required for decoding of the OO-PID at the DA.

Steps 1-3 are implemented in an atomic unit of work. It is impossiblefor the data supplier DS to read or write any information generatedbetween the single steps. The random component used in step 2 may changeon pre-determined conditions, preferably for every data set.

Steps 2 and 3 may be performed in multiple iterations, each iterationdone by using a different key.

At the data aggregator side DA decryption is executed on the additionalencryption according to step 3 by using its private key DA-Priv-Key todecrypt the received encrypted identifier OO-PID. The outcome 0-PID+RCwill be further processed by erasing the known number of digits at theend of the string that represent the random number. The resultingoutcome is the O-PID. The lifetime of this single encrypted identifierO-PID at the data aggregator side DA is defined by the interval lengthof the generated DS-Key. If the interval length of the DS-Key haselapsed a new DS-Key and therefore a new O-PID will be generated at theDS.

The regular change of the PID decreases the probability for creation ofdynamic profiles and thus the disproportionality between the effort forreidentification and need of protection is affected in two ways

The original PID is only visible at the data supplier side DS since thedata aggregator side DA only knows the single encrypted identifierO-PID. Therefore, it is impossible to build a catalogue (a table thatassigns each non anonymized PID to its anonymized counterpart, theO-PID) within the premises of one single party.

The outcome of the above-explained multi-level anonymization process(MAP) is that the data supplier DS is not able to find out theobfuscated PID. The same applies to the data aggregator DA which is notable to find out the original PID on the basis of the suppliedobfuscated PID.

However, as explained in the introductory part of the description,direct deanonymization is still possible for location event data setswhich are triggered by visible events. A potential attacker couldobserve a subscriber and a visible event and assign his observation to aspecified location event data set supplied by the data supplier. In thatcase, the anonymized subscriber, for instance the O-PID is identified.

To avoid direct deanonymization an additional anonymization componentwhich refers to the inventive idea of the application is integrated intothe complete anonymization process. FIG. 1 shows a possible embodimentof the present invention. It describes a technical solution for theanonymization of different data sets delivered by one single datasupplier DS. Anonymization as well as the transmission of these datasets to one single data aggregator DA is processed by entirely separatedprocesses running at the data supplier DS. The different kinds of datasets can be combined on the basis of the equal identifiers O-PID at thedata aggregator DA.

The whole process is subdivided into two independent multi-levelanonymization processes (MAP) where the personal identifiers PID (asunique elements between the data sets) are separately anonymized andtransmitted to the data aggregator together with their respective datasets. Thereby, the first MAP process 10 is responsible for transmittingthe so-called customer class data which includes attributes classifyingthe subscribers into different subscriber class groups, for instancegender or age groups.

The second MAP process 20 is responsible for transmitting the so-calledlocation event data sets with attributes including the event type, atimestamp when the event occurred and the subscriber location definingthe location where the event occurred. The location data set mandatorilyincludes at least a timestamp, further attributes as event type andlocation are optional. The PID is anonymized by the base anonymizationand the addition encryption which is performed iteratively for twotimes.

As can be seen in FIG. 1, the location event data sets in combinationwith their obfuscated PID are transmitted to a trusted partner TP whoexecutes the location event data filtering 50 which refers to theinventive method of the present invention. Before the execution of thefiltering process 50 the second additional encryption is undone 51. Theinventive location event data filtering process will only approve thoselocation event data sets which show a minimized risk for directdeanonymization. The approved location event data sets will then betransmitted to the data aggregator DA. The data aggregator can use theapproved location event data sets for further processing.

Details of the location event data filtering process 50 according to theinvention will be explained on the basis of FIGS. 4a, 4b 4c which showseveral flow diagrams of the respective sub processes. FIG. 4a shows ablock diagram of the necessary steps for undoing the additionalencryption in process 51 of FIG. 1. At block 100 the trusted partner TPreceives an anonymized location event data set. Due to the previousmulti-level anonymization process (MAP) with k additional encryptionsreversed by the DA, each location event data set does not provide anyinsight to the individual subscriber referenced to the location eventdata set. For instance, in FIG. 1 k is equal to 2. Therefore, in thesecond step 200 the last additional encryption of the obfuscated PID isreversed by using the private key 210. The resulting outcome 220 is thek−1 obfuscated PID including a random component. The random component isremoved in block 230 resulting in a k−1 obfuscated anonymized identifierin block 240. If only one iteration of additional encryption was appliedat the DS (k=1), the result will be effectively be the base anonymizedidentifier O-PID.

The location event data set 301 including the resulting k−1-PIDaccording to block 240 and the non-encrypted attributes is transferredto the filtering process 50. The filtering block starts with theso-called subprocess timestamp obfuscation 300 shown in FIG. 4b . Atblock 310 the subprocess checks the rule database/configuration 311whether specified filtering rules exist depending on the event type ofthe location event data set 301. For example, block 310 checks whetherthe included event type of the location event data set is an observableevent, which can be observed by a potential attacker. If the event typeis a non-observable event type the method will step over to block 340.Examples for an observable event type are the transmission of a shortmessage, initiating an outgoing call, receiving an incoming call orinitiating a data session. An example for a non-observable event type isa handover procedure within a mobile communication network which handsover a mobile terminal of a subscriber from a first mobile cell to anadjacent mobile cell. It should be noted that such a handover procedureis also performed during operation of a wireless location area network.

If the trusted part discovers that the event type of the checkedlocation event data set is an observable event type, the method proceedswith block 330. In block 330 the included timestamp attribute of thelocation event data set is obfuscated by manipulation. In detail, thetimestamp will be modified as described below to avoid any directdeanonymization by observing the respective event. The modification ofthe timestamp may be done by assigning the event to a time-frame,rounding the actual timestamp or offsetting the timestamp by certainoffset, preferably determined randomly. Afterwards, the method will stepover to subprocess 400 for filtering.

FIG. 4c discloses the filtering of locations with too little activity.In summon, the filtering process compares the location of severallocation event data sets received within a certain monitoring timeinterval and counts the number of different obfuscated PIDs for eachdetected location. Thereby, a location list is created citing alllocations and the number of different OO-PIDs who triggered an event atsaid location. The monitoring time interval usually corresponds to thelifetime of the first private key used at the data supplier for singleobfuscating the PID. The filtering process can also be applied tonon-obfuscated personal identifiers PID as pre-processing of data.Generally, the inventive filtering process is an independent processwhich can be performed at any stage of location of the described systemor any other system.

In detail, at block 401 the location of the location attribute of thereceived location event data set is determined and compared to alocation list 402. If the list already contains a matching location theanonymous ID (k−1)-PID is added to the stored location in step 404. Ifthe list 402 does not contain a matching entry a new location is createdas a new entry in the location list 402 and marked as “locked” accordingto step 403. Such locations which are marked as locked are not approvedfor transmission to the DA entity wherein location marked as “unlocked”are approved for transmission. Further, each location is marked by itscurrent status “locked/unlocked” together with a timestampcharacterising the point of time at which a change of status occurred.Subsequently to step 403, the (k−1)-PID will be added to the newlocation in the list in step 404.

In the next step 405 the process checks the current status of thelocation to which a (k−1)-PID has currently added. If the location iscurrently marked as “unlocked” the process will recheck in step 406whether the status is still eligible. If “yes” the respective locationevent data set is forwarded to the next processing stage 500 whichincludes the transmission to the DA. If the marking of the location isnot eligible the location is marked as “locked” in step 407 and theprocess continues with step 408 which is also executed when the firstcheck in step 405 has revealed that the location is currently marked as“locked”.

Step 408 counts the number of different (k−1)-PIDs which have been addedto the determined location of the location event data set within acertain time interval. The beginning of said interval is defined by thetimestamp stored for each location. If the number of different k−1-PIDsper location does not exceed a certain threshold, the respectivelocation event data set is added to a temporary queue 411 in step 409.If the status of these locations remains “locked” for a certain timeinterval a location event filtering process is applied to these datasets at subprocess 600. The subprocess 600 will be explained later on.

If the number exceeds the threshold, the location is marked as“unlocked” together with a timestamp and the all location event datasets included in the queue 411 and referring to the counted (k−1)-PIDsare forwarded to the next processing stage 500 which is responsible forthe transmission of the data sets to the data aggregator DA.

Location event data sets referring to “locked” locations which have notbeen unlocked in a certain timeframe can be processed by three differentoptions in subprocess 600.

As a first option, these location event data sets are completelydiscarded.

As a second option, it is possible to cancel the (k−1)-PIDs included inthe location event data sets of the “locked” locations. Subsequently,location event data sets of “locked” locations are transmitted to thedata aggregator without any identifying information. Indeed, the dataaggregator DA can use the received location event data for furtherapplications, however, there will be no referencing between the receiveddifferent location event data sets. For instance, it is not determinablewhether different data sets have been generated by one or moresubscribers.

As a third option, it is also possible to combine the location eventdata sets of two or more adjacent locations and to accumulate theirnumbers of different subscribers. For example, if the list includes twolocations having each a low number of different subscribers, bothlocations are combined with each other in order to exceed the respectivethreshold for the number of different subscribers.

A location can be defined as a certain geographical area. It ispreferred to combine those locations which show the lowest number ofsubscribers and located adjacent to each other. If the combination ofnon-approved locations does not achieve the defined threshold, it isalso possible to combine a non-approved location with an approvedlocation. It is mandatory for approval that the combined locationsachieve the respective threshold of different subscribers. If thenecessary threshold is exceeded the location attributes of therespective location event data set of combined locations are replaced bya common location attribute defining the combined location area.However, it is also possible to combine congeneric geographical areaswhich are not located adjacent to each other. Subsequently, saidlocation event data sets are transmitted to the data aggregator DA.

The step of combining can be also performed by increasing thegeographical area and/or increasing the radius of said geographical areafor which the number of location event data sets is counted and/or byincreasing the inaccuracy of the area or the decision whether an eventoccurred within said area

FIG. 2 shows an extended approach of the embodiment depicted in FIG. 1.As can be seen, in addition to the process of filtering location eventdata 50 filtering of customer class data/static data is performed at thetrusted partner TP.

The anonymized customer class data 10 is sent to the trusted partner TPand analysed by the separated data filtering process 60. Thereby, thefiltering process 60 identifies too specific profiles. If such a profileis detected, the filtering process 60 offers two differentpossibilities.

As a first option, it is possible to drop the whole detected profile orrather the respective location event data sets identified by thefiltering process 60. As a second option it is possible to drop orgeneralize just single attributes included in the customer class data.For example, if the customer class data includes attributes referring tothe age or gender of a certain subscriber, it is possible to generalizethe attribute age. The generalization can be performed by replacing theoriginal attribute age including a certain age by an age interval, forinstance 30 years to 40 years. Due to that manipulation of a certainattribute of the customer class group, a certain attribute group willcontain more different subscribers. The increasing number of subscribersper group complicates the identification of specific subscriberprofiles. Therefore, an indirect deanonymization through individualstatic data profiles is avoided. After a generalization of the staticdata profiles, customer group data is approved and sent to the dataaggregator DA.

So far, the filtering is applied to the entire customer base or allpersons in the data base of the mobile network provider and filtereddata is provided to the data aggregator as a whole. This may lead torather coarse results striving for sufficient level of anonymity.However, if a subscriber shows a discrepancy from normal behaviour, forinstance if the subscriber is on vacation and the actual locationdiffers from the home location for a long time, the subscriber will beconspicuous during the filtering process requiring a filtering out ofsaid subscriber, called suppression. Since the static data filter atthis point in the process is not aware of a subscribers location apurely static view might not be sufficient. In the light of the above itis a preferred embodiment of this invention to provide filtered andtherefore anonymized static data to the data aggregator only based onpredetermination of a subset of data and not as a whole.

A predetermination of said subset is triggered by a certain request ofthe data aggregator which is sent to the static data filter 60 handlingthe incoming request and providing filtered static data to the dataaggregator according to the received request. The data aggregatorcomprises only event-based data, for instance approved location eventdata received from the location event data filtering process 50. Thedata supplier DS comprises only static data (CRM data and event dataderived from the statistics). By using different IDs (PIDs) in bothareas, the data can be related to each other only via said request ofthe data aggregator DA.

The general function of the static data filter 60 as described withrespect to FIG. 2 remains the same. A request logic is implemented whichselects a subset out of the entire data base to the static data filter60 wherein said subset is individually for each request. This enablesthe filtering process to provide a “best possible data quality” whilestill complying with the anonymity regulations. Further, the static datafilter 60 is upgraded by a configuration interface allowing anindividual filter configuration for each request. Therefore, it ispossible to define for each individual case the “best possible dataquality” (e.g. relationship between suppression and generalization).

Such a predetermination procedure is depicted in FIG. 5. Database“location data” 70 includes event data sets for which only theircorresponding short-time double obfuscated ID (ST-OO-PID) is known. Inblock 1 “Request Definition” at the data aggregator a definition of ageographically and temporally section to be requested is made, forinstance “Berlin Alexanderplatz” on “22.12.2014 13:00-20:00 clock”. Thetime window can be defined for each request within a maximum of one yearwhich corresponds to the maximum long time interval (LT). Thus, thepredetermined event data sets correspond to a subset of all possibleusers or records from the static data. Further, the relevant attributes(e.g. “sex” and “home area”) are defined. If the considered time windowcovers several days (generally short time intervals ST) a query for thefrequency of the reoccurrence of a certain event is possible.

Block 1 could optionally contain the following steps. A static datafilter configuration can be defined for each individual request. Forinstance, said configuration contains information about the weighting ofthe relevant attributes, that is to say which attribute should begeneralized with higher priority. Further, the minimum level ofhierarchy of each attribute can be set that is to say the minimum degreeof accuracy which is desired for a relevant attribute. Moreover, themaximum hierarchy level of each attribute can be determined defining themaximum allowable generalization level before the data set issuppressed. Lastly, a general weighting between suppression andgeneralization can be defined.

Further, active IDs are determined within block 1.

The location abstractor 2 performs a mapping between the geographicalsections defined in block 1 and the mobile cells of the mobilecommunication network. The mapping process might include “virtual”intermediate levels (GRID-structure) and algorithms for statisticalimprovement of the geographical accuracy, for instance differentweighting of different geographical areas (streets or public places withhigh density of use vs. forests or lakes with low density of use, etc.).

The ID extractor 3 derives the relevant IDs which have been activewithin the defined time window and the actual (or possibly virtual)geographic section defined by the request of block 1.

In step 4 the request message is created containing a request ID, theactivity period of the request associated to a considered long-timeinterval (LT), a list of all relevant IDs derived by the ID extractor 3,a list of relevant attributes as determined by the request definition inblock 1 and optionally a static data filter configuration. Afterwardsthe request is forwarded to the temporal processing zone TPZ which isnormally located at a trusted partner TP (FIG. 2).

At block 5 at the temporal processing zone TPZ the requested service ischecked for authorization. If the authorization process is successful aservice type (here: Group Statistics Service) is chosen and the incomingrequest is accepted. Module 6 decrypts the transmitted IDs included inthe received request by using the respective period decryption key whichis valid for the activity period. The decryption process creates LT-IDs(LT-O-PID).

If a query for the frequency of the reoccurrence of a certain eventwithin a short time interval was defined as a relevant attribute inblock 1, module “duplicates handler” 7 calculates the classes(reoccurring classes) which reoccurred within the defined time interval.The calculation is implemented by counting the frequency of reoccurringof various short-term IDs (ST-O-PID) for each long-term ID (LT-O-PID).The respective calculated reoccurring class is assigned to eachlong-term ID. For instance, long-term ID X was active for a total numberof four times within different short-term IDs A, B, C, D. Long-time ID Xis thus assigned to the reoccurring class no. 4. The short-term IDs areirrelevant for the further process and are discarded after formation ofthe reoccurring classes.

Further, module 7 polls the relevant attributes for each long-term ID asdefined by the request from the static data database 80 which is locatedat the storage zone SZ (regularly at the data supplier DS). In responseto the polling step encrypted attributes are delivered to module“attributes decryption” which decrypts the requested attribute valueswith the corresponding decryption key which is independent from the keyused for decryption of the short-term IDs.

The decrypted attributes are forwarded to the static data filter 60which generates segments for each unique combination of relevantattributes as defined by the request definition of module 1. The filterperforms a dynamic adjustment of the accuracy for each attribute untileach segment comprising at least five long-term IDs. The adjustment isexecuted on the basis of defined hierarchy levels wherein the presetminimum and/or maximum hierarchy level is considered at the filter 60.Besides a generalization of long-term records of individual IDs it isalso possible to completely discard (suppression) individual data sets.The preferred ratio between generalization (less accurate informationfor the largest possible number of originally queried IDs) andsuppression (more details for a possibly smaller number of originallyrequested IDs) can optionally be preset by the data aggregator.

In case of a group statistics service the static data filter 60 providesan aggregated segment information for a minimum number of different IDs.The long-term IDs used by the static filter 60 are discarded after thefiltering process.

Afterwards, a response is generated at module 5 on the basis of thefilter output. The response includes the respective response IDreferring to the request ID, the segment definitions and the number oflong-term IDs for each segment (for example segment 673 comprises 15IDs). The response is displayed and/or stored and transmitted to thedata aggregator.

The definition and usage of the aforementioned hierarchy levels isexplained in the following. The static data filter 60 dynamicallycombines the concrete expressions (values) of individual attributes toclasses. Each class includes a specific range of values. Forgeneralization classes with a smaller range of values (more detailedinformation) can be grouped into classes with a wider range of values(=inaccurate information). Hierarchy levels specify how individualclasses are nested within each other.

For instance, for the attribute “age” following hierarchy levels couldbe determined:

Level 5: n/a Level 4: <=49; >=50 Level 3: <=29; 30-49; >=50 Level 2:<=18; 19-29; 30-39; 40-49; 50-59; >=60 Level 1: <=18; 19-24; 25-29;30-34; 35-39; 40-44; 45-49; 50-54; 55-59; 60-64; >=65

Both classes and hierarchy levels must be fixed dependent of the serviceand can be adjusted only in exceptional cases. By the optionalspecification of a minimum and/or maximum hierarchy level within arequest the accuracy/inaccuracy of an information can be dynamicallyspecified for the best case (minimum level in the hierarchy) or worstcase (maximum hierarchy level).

For instance, for the attribute “age” level 2 could be defined as theminimum level in the hierarchy, when information regarding age groups isrequested which have an age difference of ten 10 years. In case of acombined request for more than one attribute a preset generalization ofthe attribute “age” may demand less generalization of another attribute(e.g. home area) to achieve the required minimum number of IDs for bothattributes.

Furthermore, level 3 could be defined as the maximum level of thehierarchy if age information which is more inaccurate than one of thethree classes <=29; 30-49; >=50 of level 3, has no relevance for thecertain request. In such a case, the algorithm will not considerfiltered data sets for which the age can only be generalized by level 4or 5 to fulfil the anonymization criteria. Data sets of long-term IDsfor which a generalization of the attribute age at level 4 or 5 isnecessary will be discarded (suppression).

The system according to FIG. 6 is basically the same as the system shownin FIG. 5. However, an additional selection and extrapolation engine isintegrated which extrapolates the predetermined subset of static databefore the filtering process. Furthermore, the embodiment of FIG. 6provides an enhanced authentication module A which allows a moreflexible setup of user rights, for instance selection commands areallowed and a definition of the response directions is possible if theresponse should be directed to a third party address.

Optionally, a preselection of processed data based on selection commandsdefined for the request is possible due to integration of the in memoryselection engine B. For example, static data associated to IDs with homeand/or work area to a certain location can be directly dropped.

The extrapolation engine extrapolate MNO sample view to total populationbased on various criteria, for instance actual market share at differenthome areas within age classes.

The additional response receiver D at the third party address enablesseparation of request and response premises to optionally allow “circlerequests”.

1-12. (canceled)
 13. A method for anonymization of static data relatedto individual subscribers of a mobile communication network wherein eachstatic data set consists of different attributes and the methodidentifies specific profiles derivable from the static data and eithersuppresses one or more respective attributes of the static data setsand/or classifies/generalizes two or more static data sets to a certaingroup having at least one matching attribute.
 14. The method accordingto claim 13 wherein generalization/classifying is performed by replacinga certain attribute of two or more static data sets with a generalizedcommon attribute.
 15. The method according to claim 13 wherein the stepof classification is performed until the number of static data sets ofeach group exceeds a defined threshold.
 16. The method according toclaim 13 wherein at least one attribute relates to the gender or thebirth date or the age or profession or place of residence of anindividual subscriber of the mobile communication network.
 17. Themethod according to claim 13 wherein a subset of the total static dataavailable at the mobile communication network is predetermined dependenton at least one criterion wherein the anonymization method is onlyperformed for the predetermined subset.
 18. The method according toclaim 13 wherein predetermination for a subset is based on a relevanttime period and/or relevant geographical section and further thepredetermination is preferably based on location event data occurredwithin the relevant time period and/or at the relevant geographicalsection.
 19. The method according to claim 13 wherein the subset ispredetermined and requested by a data aggregator receiving theanonymized static data.
 20. The method according to claim 13 wherein theanonymization method is individually configurable for each individuallypredetermined subset wherein preferable configuration parameters arepassed together with a request of the data aggregator.
 21. The methodaccording to claim 20 wherein configuration of the anonymization methodincludes a prioritization of the attributes to be generalized/suppressedand/or a definition of a maximum/minimum hierarchy level for eachattribute and/or a definition of the ratio between generalization andsuppression.
 22. (canceled)